This guide provides detailed steps on the problem scenarios and solution on Longhorn PVC mount failures.

Problem with multipathd service

In some cases, Longhorn fails to mount Persistent Volume Claims (PVC) to pods in a Kubernetes cluster. This issue is typically caused by conflicts with the multipathd service, which may mistakenly identify Longhorn volumes as being in use, preventing the filesystem from being created.

The multipathd service is responsible for managing multiple paths to the same storage device. When it incorrectly identifies a Longhorn volume as being in use, it blocks the filesystem creation process, resulting in mount failures.

You might encounter the following error message in your Kubernetes environment:

Error Message:

Warning  FailedMount  12s (x6 over 28s)  kubelet  
MountVolume.MountDevice failed for volume "pvc-87285c92-26c4-40bd-842d-7f608d9db2d8": 
rpc error: code = Internal desc = format of disk "/dev/longhorn/pvc-87285c92-26c4-40bd-842d-7f608d9db2d8" failed:
type: ("ext4")
target: ("/var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/1e70ad7ff7c1222b1d656429fcc03679fdfa8ed3d9ae0739e656b2e161bfc08d/globalmount")
options: ("defaults")
errcode: (exit status 1)
output: (
  mke2fs 1.46.4 (18-Aug-2021)

  /dev/longhorn/pvc-87285c92-26c4-40bd-842d-7f608d9db2d8 is apparently in use by the system; will not make a filesystem here!
)

Solution

Follow these steps to resolve the issue:

Step 1: Edit the multipath.conf File

  1. Open the multipath.conf file for editing:
    vi /etc/multipath.conf
  2. Add the Configuration.
    • Add the following configuration to multipath.conf file on all nodes in the cluster:
      blacklist {
          devnode "^sd[a-z0-9]+"
      }
    • After adding the configuration, the file should look like this:
      defaults {
          user_friendly_names yes
      }
      blacklist {
          devnode "^sd[a-z0-9]+"
      }

Step 2: Restart the multipathd.service

After the multipath.conf file update, restart the multipathd service on all nodes in the cluster. Use the below command to restart it.

systemctl restart multipathd.service

Step 3: Delete and Recreate the Affected Pods

To apply the changes and resolve the issue, delete the affected pods so that Kubernetes can recreate them with the corrected configuration:

kubectl delete pod nextgen-gw-0 nextgen-gw-redis-master-0

Problem with longhorn file corruption

  • Longhorn cannot remount the volume when the Longhorn volume has a corrupted filesystem. The workload then fails to restart as a result of this.
  • Longhorn cannot fix this automatically. You will need to resolve this manually when this happens.
    You might encounter the following error message in your Kubernetes environment:

    Error Message:
    Events: 
      Type     Reason       Age                  From     Message 
      ----     ------       ----                 ----     ------- 
      Warning  FailedMount  56s (x5809 over 8d)  kubelet  MountVolume.MountDevice failed for volume "pvc-b3ca140a-dab9-49f6-9f39-063594e58521" : rpc error: code = Internal desc = 'fsck'                                                        found errors on device /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 but could not correct them: fsck from util-linux 2.39.3 
    /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 contains a file system with errors, check forced. 
    /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521: Unattached inode 1555 

Solution

Follow these steps to resolve the issue:

Step 1: Identify the Node Running the Pod

Run the following command to find the node where the gateway pod is running:

kubectl get pods –o wide 

Sample Responce:

root@opsramp-gateway:/home/gateway-admin# kubectl get pods -o wide 
NAME                        READY   STATUS    RESTARTS        AGE   IP               NODE             NOMINATED NODE   READINESS GATES 
nextgen-gw-0                0/3     ContainerCreating   0               12m   10.42.0.31       opsram-pgateway   <none>           <none> 
nextgen-gw-redis-master-0   1/1     Running   0               25m   10.42.0.29       opsramp-gateway   <none>           <none> 

From this output, we see that the gateway pod is running on the opsramp-gateway node.

Step 2: Login to the node and fix the file corruption issue

Log in to the node (opsramp-gateway) where the pod is running. Then, run the following command to repair the corrupted filesystem:

fsck –y <file-path> 
Kubectl describe pod nextgen-gw-0 

Sample Responce:

Events: 
  Type     Reason       Age                  From     Message 
  ----     ------       ----                 ----     ------- 
  Warning  FailedMount  56s (x5809 over 8d)  kubelet  MountVolume.MountDevice failed for volume "pvc-b3ca140a-dab9-49f6-9f39-063594e58521" : rpc error: code = Internal desc = 'fsck'                                                        found errors on device /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 but could not correct them: fsck from util-linux 2.39.3 
/dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 contains a file system with errors, check forced. 
/dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521: Unattached inode 1555 

In this case, the file path is /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521, so run:

fsck –y /dev/longhorn/pvc-b3ca140a-dab9-49f6-9f39-063594e58521 

Step 3: Delete the Affected Pods

To apply the fixes, delete the affected pod so Kubernetes can recreate it:

kubectl delete pod nextgen-gw-0 

If multiple pods are affected, repeat the deletion process for each.